Journal of the American Medical Informatics Association
◐ Oxford University Press (OUP)
All preprints, ranked by how well they match Journal of the American Medical Informatics Association's content profile, based on 61 papers previously published here. The average preprint has a 0.11% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
Basu, S.; Baum, A.
Show abstract
BackgroundClinicians in care management programs are often in low supply relative to patient demand, especially in US Medicaid programs, and must simultaneously address clinical risk, time efficiency, and patients social needs. Many studies have shown that large language models may assist in their tasks for summarizing patient care, such as in generating care plans; yet these studies also show that different objectives given to agents often conflict and produce problems for safety, efficiency and equity. We tested whether and to what degree using game theoretic approaches (a Nash bargaining framework) can produce care plans that advance multiple objectives across multiple language models, applying data from a real-world Medicaid cohort. MethodsWe conducted two studies in a cohort of 5,148 activated Medicaid care management patients (69.9% female; 45.7% Black or African American; mean age 40.9 years) enrolled in Virginia and Washington. A retrospective evaluation applied five deterministic strategies to the full cohort to characterize multi-objective trade-offs. A pre-registered controlled paired experiment (N = 200) assigned each patient one Nash-orchestrated multi-agent plan and one compute-matched sequential self-critique plan, generated by locally hosted open-source models (DeepSeek-R1 8B; Llama 3.1 8B) with no patient data leaving local infrastructure. Pre-specified outcomes were Safety, Efficiency, Equity, and Composite (mean of the three), each scored 0-1. Reporting follows CONSORT 2010 and STROBE. ResultsNash orchestration produced a Composite score of 0.755 (95% CI 0.751-0.760) versus 0.742 (95% CI 0.739-0.746) for the compute-matched baseline; the paired difference was 0.013 (95% CI 0.008-0.019; p = 6.20 x 10-). Safety and Efficiency paired differences were small-to-moderate in effect size (Cohens d = 0.327 and 0.543, respectively) with confidence intervals excluding zero. The Equity paired difference was 0.000 (95% CI -0.015 to 0.014; p = 0.987). ConclusionsRole-specialized Nash-orchestrated multi-agent language models produced measurably better Safety and Efficiency care plan quality than a compute-matched baseline under data-residency constraints. The null Equity result demonstrates that multi-objective role specialization does not automatically address equity--equity requires explicit design attention beyond composite weighting--with direct implications for responsible AI deployment in Medicaid care management. Author SummaryCare management programs for Medicaid patients need to address multiple goals at once: covering clinical risks, prioritizing the most impactful interventions, and recognizing the social barriers that affect whether patients can follow through on care plans. Prior research shows that automation tools powered by a single AI model tend to optimize for one of these goals at a time, sacrificing the others. We tested whether organizing several specialized AI agents -- each focused on a different goal -- and then combining their recommendations through a mathematical framework called Nash bargaining could produce better overall care plans for a real Medicaid population. We found that this multi-agent approach produced care plans that the AI judge rated as meaningfully safer and more efficient than plans generated by a single AI model using the same total amount of computation. However, the multi-agent approach did not produce plans that were more equitable in addressing patients social needs, suggesting that equity requires more direct attention as a design target rather than emerging from multi-objective combination alone. All AI inference was performed on locally hosted computers, with no patient information sent to outside services, reflecting the privacy requirements of real-world Medicaid care management programs.
Lian, Y.; Jiang, X.; Long, Q.
Show abstract
ObjectiveElectronic health records (EHRs) collected from diverse healthcare institutions offer a rich and representative data source for clinical research. Federated learning enables analysis of these distributed data without sharing sensitive patient-level information, preserving privacy. However, missing data remain a major challenge and can introduce substantial bias if not properly addressed. Very few distributed imputation methods currently exist, and they fail to account for two critical aspects of EHR data: correlation within sites and variability across sites. We aim to fill this important methodological gap. MethodsWe propose Distributed Mixed Model-based Multiple Imputation (D3MI), a novel federated imputation method designed to reduce bias in distributed EHRs. D3MI integrates the strengths from federated learning techniques, statistical learning methods for correlated data, and multilevel imputation algorithms to explicitly account for both and within-site correlation and between-site heterogeneity using site-specific random effects. It preserves privacy by avoiding sharing raw data and features communication and computational efficiency. ResultsThrough extensive simulation studies, we demonstrate that D3MI outperforms SOTA distributed imputation methods in both accuracy and consistency. We further demonstrate the use of D3MI in a real-world EHR case study involving incomplete and clustered data from participating hospitals in the Georgia Coverdell Acute Stroke Registry. ConclusionBy explicitly modeling the complex structure of distributed EHR data, D3MI addresses key limitations of existing approaches. It provides a powerful and efficient solution for handling missing data in distributed and privacy-sensitive settings and enhances the rigor and reproducibility of collaborative clinical research.
Wagholikar, K. B.; Pacheco, J. A.; Gordon, A. S.; Khan, A.; Khales, B. N.; Benoit, B.; Kerman, B. J.; Weng, C.; Ta, C.; Prows, C. A.; Johnson, R.; Roden, D. M.; Crosslin, D.; McNally, E. M.; Karlson, E. W.; Mentch, F.; Jarvik, G. P.; Wiesner, G. L.; Hakonarson, H.; Cimino, J. J.; Thayer, J. G.; Smoller, J. W.; Linder, J. E.; Connolly, J.; Peterson, J. F.; Cortopassi, J.; Kiryluk, K.; Hamed, M.; Maradik, M.; Puckelwartz, M. J.; Naderian, M.; Walton, N.; Limdi, N.; Maripuri, D. P.; Walunas, T.; Gainer, V.; Luo, Y.; Liu, C.; Kenny, E. E.; Espinoza, A.; Rowley, R.; Wei, W.-Q.; Murphy, S.
Show abstract
Pragmatic clinical trials (PCTs) evaluate interventions in real-world settings, often using electronic health records (EHRs) for efficient data collection. We report on the challenges in performing EHR analysis of health-care provider orders in a PCT within the eMERGE consortium, which investigates the impact of reporting genome-informed risk assessments (GIRA) to over 25,000 patients across 10 academic medical centers. Clinical informaticians conducted a landscape analysis to identify approaches for evaluating the outcomes of GIRA reporting through the EHR. Of 98 identified outcomes, 54 (55.1%) were determined to be difficult to extract because they involved provider orders, which are typically documented in free text or proprietary formats within the EHR and only mapped to standardized codes after the service is completed. These findings highlight a critical barrier in using EHRs to support PCTs. The authors recommend closer collaboration between clinicians and informaticians, improved EHR systems that support standardized order entry, and future use of machine learning to automate analysis of provider behavior in clinical trials.
Lett, E.; Shahbandegan, S.; Barak-Corren, Y.; Fine, A.; La Cava, W. G.
Show abstract
BackgroundFair clinical prediction models are crucial for achieving equitable health outcomes. Recently, intersectionality has been applied to develop fairness algorithms that address discrimination among intersections of protected attributes (e.g., Black women rather than Black persons or women separately). Still, the majority of medical AI literature applies marginal de-biasing approaches, which constrain performance across one or many isolated patient attributes. We investigate the extent to which this modeling decision affects model equity and performance in a well-defined use case in emergency medicine. MethodsThe study focused on predicting emergency room admissions using electronic health record data from two large U.S. hospitals, Beth Israel Deaconess Medical Center (MIMIC-IV-ED, n=160,016) and Boston Childrens Hospital (BCH, n=22,222), covering both adult and pediatric populations. In a comprehensive experiment over fairness definitions, modeling methods, we compared the performance of single- and multi-attribute, marginal de-biasing approaches to intersectional de-biasing approaches. ResultsIntersectional de-biasing produces greater reductions in subgroup calibration error (MIMIC- IV: 21.2%; BCH: 27.2%) than marginal de-biasing (MIMIC-IV: 10.6%; BCH: 22.7%), and also lowers subgroup false negative rates on MIMIC-IV an additional 3.5% relative to marginal de-biasing. These fairness gains were achieved without a significant decrease in model accuracy between baseline and intersectionally-debiased models (MIMIC-IV: AUROC=0.85{+/-}0.00, both models; BCH: AUROC=0.88{+/-}0.01 vs 0.87{+/-}0.01). Intersectional de-biasing more effectively lowered subgroup calibration error and FNRs in low-prevalence groups in both datasets compared to other de-biasing conditions. ConclusionIntersectional de-biasing better mitigates performance disparities across intersecting groups compared to marginal approaches for emergency admission prediction. These strategies meaningfully reduce group-specific error rates without compromising overall accuracy. These findings highlight the importance of considering interacting aspects of patient identity in model development, and suggest that intersectional de-biasing would be a promising gold standard for ensuring equity in clinical prediction models.
Horng, S.; Joseph, J.; Calder, S.; Stevens, J. P.; O'Donoghue, A. L.; Safran, C.; Nathanson, L. A.; Leventhal, E.
Show abstract
ImportanceElectronic health records (EHRs) allow teams of clinicians to simultaneously care for patients but an unintended consequence could result in duplicate ordering of tests and medications.\n\nObjectiveWe asked if a simple visual aid would reduce duplicate ordering of tests and medications for busy teams of clinicians in our emergency department by placing a red highlight around the checkbox of a computer-based order if previously ordered.\n\nDesignWe performed an interrupted time series to analyze all patient visits 1 year before and 1 year after the intervention. Significance testing was performed using a negative binomial regression with Newey-West standard errors, correcting for patient level variables and environmental variables that might be associated with duplicate orders.\n\nSettingThe emergency department of an academic hospital in Boston, MA with 55,000 visits annually.\n\nParticipants184,722 consecutive emergency department patients.\n\nExposureIf an order had previously been placed during that ED visit, we cue the user by showing a red highlight around the checkbox of that order.\n\nMain OutcomeNumber of unintentional duplicate orders.\n\nResultsAfter deployment of the non-interrupting nudge, the rate of unintentional duplicates for laboratory orders decreased 49% (incidence rate ratio 0.51, 95% CI 0.45-0.59) and for radiology orders decreased with an incidence rate ratio of 0.60 (0.44-0.82). There was no change in unintentional medication duplicate orders. We estimated that the nudge eliminated 17,936 clicks in our EHR.\n\nConclusions and RelevancePassive visual queues that provide just-in-time decision support are effective, not disruptive of workflow, and may decrease alert fatigue in busy clinical environments.\n\nKey PointsO_ST_ABSQuestionC_ST_ABSCan a simple visual aid reduce duplicate ordering in an electronic health record?\n\nFindingsIn this interrupted time series, the rate of unintentional duplicates for laboratory orders decreased 49% and for radiology orders decreased 40%. There was no change in unintentional medication duplicate orders. We estimated that the nudge eliminated 17,936 clicks in our EHR.\n\nMeaningQuality improvement often relies on changing clinician behavior. We believe guiding clinicians to a right action is better than telling the clinician they have already made an error. Our approach will help reduce alert fatigue and lessen clinician complaints about EHRs.
Holt, A. W.; Smalheiser, N. R.
Show abstract
We have developed a free, public web-based tool, Trials to Publications, https://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/TrialPubLinking/trial_pub_link_start.cgi, which employs a machine learning model to predict which publications are likely to present clinical outcome results from a given registered trial in ClinicalTrials.gov. The tool has reasonably high precision, yet in a recent study we found that when registry mentions are not explicitly listed in metadata, textual clues (in title, abstract or other metadata) could identify only roughly 1/3-1/2 of the publications with high confidence. This finding has led us to expand the scope of the tool, to search for explicit mentions of registry numbers that are located within the full-text of publications. We have now retrieved ClinicalTrials.gov registry number mentions (NCT numbers) from the full-text of 3 online biomedical article collections (open access PubMed Central, EuroPMC, and OpenAlex), as well as retrieving biomedical citations that are mentioned within the ClinicalTrials.gov registry itself. These methods greatly increase the recall of identifying linked publications, and should assist those carrying out evidence syntheses as well as those studying the meta-science of clinical trials. HighlightsO_LIThose conducting systematic reviews, other evidence syntheses, and meta-science analyses often need to examine published evidence arising from clinical trials. Finding publications linked to a given trial is a difficult manual process, but several automated tools have been developed. The Trials to Publications tool is the only free, public, currently maintained web-based tool that predicts publications linked to a given trial in ClinicalTrials.gov. C_LIO_LIA recent analysis indicated that the Trials to Publications tool has good precision but limited recall. In the present paper, we greatly enhanced the recall by identifying registry mentions in full-text of articles indexed in open access PubMed Central, EuroPMC and OpenAlex. C_LIO_LIThe tool now has reasonably comprehensive coverage of registry mentions, both for identifying articles that present trial outcome results and for other types of articles that are linked to, or that discuss, the trials. This should greatly save effort during web searches of the literature. C_LI
Al-Garadi, M.
Show abstract
IMPORTANCEAlthough angiotensin-converting enzyme inhibitors (ACEIs) and angiotensin receptor blockers (ARBs) are recommended for people with chronic kidney disease (CKD), they remain underused. Barriers to adherence, such as adverse effects or patient refusal, are frequently embedded within unstructured clinical narratives and are therefore inaccessible to structured data analytics. Scalable natural language processing (NLP) approaches are needed to identify these barriers and support guideline-concordant care. OBJECTIVETo develop and evaluate an NLP model capable of identifying documented reasons for ACEI/ARB non-use within clinical notes of people with CKD in the Veterans Affairs (VA) healthcare system. DESIGN, SETTING, AND PARTICIPANTSThis retrospective study analyzed electronic health record data from 2005 to 2024 including people aged 18 to 80 years with CKD, defined by an estimated glomerular filtration rate (eGFR) of 20-60 mL/min/1.73 m2 and presence of albuminuria, across multiple VA medical centers. NLP models were trained on 1,025 manually annotated notes and further augmented with 4,600 synthetic examples generated through schema-guided large language model prompting. MAIN OUTCOMES AND MEASURESThe primary outcome was model performance in identifying notes containing at least one documented reason for ACEI/ARB non-use, evaluated using F1-score, precision, and recall. Secondary outcomes included model learning curve analyses and the effect of synthetic data augmentation on classification performance. RESULTSThe most common documented reasons for ACEI/ARB non-use were acute kidney injury (29.6%), increased creatinine (12.4%), cough (11.2%), and hypotension-related symptoms (11.1%). Across modeling approaches, training with synthetic data augmentation improved detection of notes containing reasons for non-use. Performance gains were statistically significant across all models (McNemar test, P < .05), with the random forest model using Nomic embeddings achieving the highest performance (F1 score, 0.79; 95% CI, 0.68-0.90). CONCLUSIONS AND RELEVANCEWe identified documented reasons for ACEI/ARB non-use (including both failures to initiate therapy and discontinuation after prior use) from unstructured text using an NLP method that does not require massive, expensive computing at inference time. By augmenting training data with schema-guided synthetic notes, we achieved robust, privacy-preserving performance within an NLP framework. This approach may support scalable clinical decision support systems to promote guideline-concordant prescribing.
Gauthier, L. W.; Willems, M.; Chatron, N.; Cenni, C.; Meyer, P.; Ruault, V.; Wells, C.; Sabbagh, Q.; Genevieve, D.; Yauy, K.
Show abstract
BackgroundPrecision medicine requires accurate phenotyping and data sharing, particularly for rare diseases. However, sharing medical reports across language barriers is challenging. Alternatively, inconsistent and incomplete clinical summary provided by physicians using Human Phenotype Ontology (HPO) can lead to a loss of clinical information. MethodsTo assess feasibility and risk of using deep learning methods to translate, de-identify and summarize medical reports, we developed an open-source deep learning multi-language software in line with health data privacy. We conducted a non-inferiority clinical trial using deep learning methods to de-identify protected health information (PHI) targeting a minimum sensitivity of 90% and specificity of 75%, and summarize non-English medical reports in HPO format, aiming a sensitivity of 75% and specificity of 90%. ResultsFrom March to April 2023, we evaluated 50 non-English medical reports from 8 physicians and 12 different groups of diseases, which included neurodevelopmental disorders, congenital disorders, fetal pathology and oncology. Reports contain in median 15 PHI and 7 HPO terms. Deep learning method achieved a sensitivity of 99% and a specificity of 87% in de-identification, and a sensitivity of 78% and a specificity of 92% in summarizing medical reports, reporting an average number of 6.6 HPO terms per report, which is equivalent to the number of HPO terms provided usually by physicians in databases (6.8 in PhenoDB). ConclusionsDe-identification and summarization of non-English medical reports using deep learning methods reports non-inferior performance, providing insights on AI usage to facilitate precision medicine. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=145 HEIGHT=200 SRC="FIGDIR/small/23293234v3_ufig1.gif" ALT="Figure 1"> View larger version (44K): org.highwire.dtl.DTLVardef@1cb8d9borg.highwire.dtl.DTLVardef@bddee9org.highwire.dtl.DTLVardef@175af12org.highwire.dtl.DTLVardef@138fddb_HPS_FORMAT_FIGEXP M_FIG Illustration of the non-inferiority trial for de-identification and summarization of non-english medical reports and main statistical performances. C_FIG
Miao, B. Y.; Binvignat, M.; Garcia-Agundez, A.; Bravo, M.; Williams, C. Y.; Miao, C. Q.; Alaa, A.; Rudrapatna, V. A.; Butte, A. J.; Schmajuk, G.; Yazdany, J.
Show abstract
ImportanceTumor necrosis factor inhibitors (TNFi) are widely used for auto-immune conditions. Despite their efficacy, many patients switch TNFis due to lack of efficacy, cost-related reasons, or adverse events. Understanding why switches occur is important, but requires extensive chart review. ObjectiveTo determine whether large language models (LLMs) can automatically perform chart review, accurately identifying TNFi switching trajectories and reasons for switching in a large real-world cohort. DesignObservational study using de-identified electronic health record data (2012-2023). Medication orders and associated clinical notes for TNFi agents were extracted; at least 6 months of follow-up was required to ascertain switches. SettingSingle academic medical center (University of California, San Francisco). Participants9,187 patients (mean [SD] age, 39.9 [19.0] years; 57.1% female) who received [≥]1 TNFi with adequate follow-up. Among these, 1,481 (16.1%) had [≥]1 TNFi switch, 418 (4.5%) had [≥]2 switches, and 150 (1.6%) had [≥]3 switches. ExposuresSwitching was defined as a change from one TNFi to a different TNFi at consecutive encounters. Main Outcome(s) and Measure(s)Using GPT-4, we extracted which TNFi was stopped or started, and reasons for switching: adverse event; drug resistance; insurance/cost; lack of efficacy; patient preference; other; unknown. Performance was compared with eight open source LLMs, structured medication data and expert annotations. ResultsAfter applying inclusion criteria, 3,104 switches between different TNFi drugs in 2,112 patients were identified. GPT-4 achieved micro-F1 scores of 0.75 for stopped TNFi, 0.80 for started TNFi, and 0.83 for switch reason. From all open-source models, Starling-7B-beta and Llama-3-8B offered the most competitive performance overall compared to GPT-4 and achieved similar win-loss ratios. The primary reasons identified by GPT-4 was lack of efficacy (56.9%), followed by adverse events (13.5%) and insurance/cost (10.8%). Conclusions and RelevanceBoth GPT-4 and locally deployable LLMs, demonstrated potential in executing complex reasoning tasks, specifically identifying reasons for switching between TNF inhibitors. This finding suggests broader application in clinical research and documentation. Further research is needed to assess model performance across additional medication classes and patient populations. Keys pointsO_ST_ABSQuestionC_ST_ABSCan large language models (LLMs) identify (TNF inhibitors) TNFi switching trajectories and reasons from clinical notes? FindingsWe used de-identified electronic health records from UCSF (University of California San Francisco) from 9,187 patients who received [≥]1 TNFi. GPT-4 achieved micro-F1 scores up to 0.830 identifying reasons and specific TNFi starts/stops compared to clinical expert annotations, surpassing eight open-source LLMs. The best open-source models, Llama-3-8b-chat-hf and Starling-7B-beta, matched GPT-4 in determining which TNFi was started but had lower accuracy in identifying reasons for switching. MeaningLLMs evaluated in this study were capable of performing complex reasoning tasks in identifying reasons for switching between TNFi. Broader application could be used for other biological but also in other pharmacoepidemiology studies and in chart summarization.
Strayer, N.; Vessels, T. J.; Choi, K. W.; Zhang, S.; Li, Y.; Sharber, B.; Hsi, R. S.; Bejan, C. A.; Bick, A. G.; Balko, J. M.; Johnson, D. B.; Wheless, L. E.; Wells, Q. S.; Shah, R. V.; Phillips, E. J.; Self, W. H.; Pulley, J. M.; Wilkins, C. H.; Chen, Q.; Hartert, T.; Savona, M. R.; Shyr, Y.; Roden, D. M.; Smoller, J. W.; Ruderfer, D. M.; Xu, Y.
Show abstract
BackgroundElectronic health records (EHR) are increasingly used for studying multimorbidities. However, concerns about accuracy, completeness, and EHRs being primarily designed for billing and administrative purposes raise questions about the consistency and reproducibility of EHR-based multimorbidity research. MethodsUtilizing phecodes to represent the disease phenome, we analyzed pairwise comorbidity strengths using a dual logistic regression approach and constructed multimorbidity as an undirected weighted graph. We assessed the consistency of the multimorbidity networks within and between two major EHR systems at local (nodes and edges), meso (neighboring patterns), and global (network statistics) scales. We present case studies to identify disease clusters and uncover clinically interpretable disease relationships. We provide an interactive web tool and a knowledge base combining data from multiple sources for online multimorbidity analysis. FindingsAnalyzing data from 500,000 patients across Vanderbilt University Medical Center and Mass General Brigham health systems, we observed a strong correlation in disease frequencies ( Kendalls{tau} = 0.643) and comorbidity strengths (Pearson{rho} = 0.79). Consistent network statistics across EHRs suggest similar structures of multimorbidity networks at various scales. Comorbidity strengths and similarities of multimorbidity connection patterns align with the disease genetic correlations. Graph-theoretic analyses revealed a consistent core-periphery structure, implying efficient network clustering through threshold graph construction. Using hydronephrosis as a case study, we demonstrated the networks ability to uncover clinically relevant disease relationships and provide novel insights. InterpretationOur findings demonstrate the robustness of large-scale EHR data for studying phenome-wide multimorbidities. The alignment of multimorbidity patterns with genetic data suggests the potential utility for uncovering shared biology of diseases. The consistent core-periphery structure offers analytical insights to discover complex disease interactions. This work also sets the stage for advanced disease modeling, with implications for precision medicine. FundingVUMC Biostatistics Development Award, the National Institutes of Health, and the VA CSRD
Bannett, Y.; Gunturkun, F.; Pillai, M.; Herrmann, J. E.; Luo, I.; Huffman, L. C.; Feldman, H. M.
Show abstract
ObjectiveTo assess the accuracy of a large language model (LLM) in measuring clinician adherence to practice guidelines for monitoring side effects after prescribing medications for children with attention-deficit/hyperactivity disorder (ADHD). MethodsRetrospective population-based cohort study of electronic health records. Cohort included children aged 6-11 years with ADHD diagnosis and >2 ADHD medication encounters (stimulants or non-stimulants prescribed) between 2015-2022 in a community-based primary healthcare network (n=1247). To identify documentation of side effects inquiry, we trained, tested, and deployed an open-source LLM (LLaMA) on all clinical notes from ADHD-related encounters (ADHD diagnosis or ADHD medication prescription), including in-clinic/telehealth and telephone encounters (n=15,593 notes). Model performance was assessed using holdout and deployment test sets, compared to manual chart review. ResultsThe LLaMA model achieved excellent performance in classifying notes that contain side effects inquiry (sensitivity= 87.2%, specificity=86.3/90.3%, area under curve (AUC)=0.93/0.92 on holdout/deployment test sets). Analyses revealed no model bias in relation to patient age, sex, or insurance. Mean age (SD) at first prescription was 8.8 (1.6) years; patient characteristics were similar across patients with and without documented side effects inquiry. Rates of documented side effects inquiry were lower in telephone encounters than in-clinic/telehealth encounters (51.9% vs. 73.0%, p<0.01). Side effects inquiry was documented in 61% of encounters following stimulant prescriptions and 48% of encounters following non-stimulant prescriptions (p<0.01). ConclusionsDeploying an LLM on a variable set of clinical notes, including telephone notes, offered scalable measurement of quality-of-care and uncovered opportunities to improve psychopharmacological medication management in primary care.
Klang, E.; Glicksberg, B. S.; Gorenshtein, A.; Gavin, N.; Freeman, R.; Stump, L.; Charney, A. W.; Ting, D. S. W.; Omar, M.; Nadkarni, G.
Show abstract
BackgroundLarge language models (LLMs) now power clinical agents that can plan, call tools, and write into electronic health records (EHRs). They are becoming actors, not assistants. Given known LLM faults, quality assurance is essential before clinical use. A key question is whether agents notice patient-identity errors or act indifferent. MethodsWe created a record environment using publicly available MIMIC-IV real-world emergency department data. Agents were instructed to copy ICD-10 codes from visit headers into patient records using Extract and Store tools, with an option to record "UNKNOWN" if uncertain or abstain. Each agent was presented with ten batched records from the same patient (clean version). Then we tampered one of the records and evaluated how the agent responded. We ran four separate batches: the clean baseline batch, a batch with one visit with a complete swapped header from another patient, a batch with one visit with a one-digit MRN change, and a batch with age shifted in one visit. Six models, both closed- and open-weight, completed 1.2 million tool calls to assess model performance. The endpoint was whether agents would identify when fields were inconsistent identity. ResultsAgents frequently failed, copying codes into tampered charts. GPT-4.1 flagged mismatched headers as UNKNOWN in 17.4% of runs but never detected subtle faults. GPT-4.1-nano detected 4.4% of header swaps and <1% of MRN or age changes. GPT-5-chat never identified mismatches but omitted responses in 12.6% of cases. Other models rarely abstained. Subtle tampering passed almost entirely without detection. ConclusionsClinical agents are often indifferent to patient details inconsistencies. The central risk is misbinding, not miscoding. Safe deployment requires explicit identity verification, abstention when uncertain, and benchmarks that treat record integrity, not just accuracy, as a primary outcome.
La Cava, W.; Lee, P. C.; Ajmal, I.; Ding, X.; Cohen, J. B.; Solanki, P.; Moore, J. H.; Herman, D. S.
Show abstract
ObjectiveElectronic health records (EHRs) can improve patient care by enabling systematic identification of patients for targeted decision support. But, this requires scalable learning of computable phenotypes. To this end, we developed the feature engineering automation tool (FEAT) and assessed it in targeting screening for the underdiagnosed, under-treated disease primary aldosteronism. Materials and MethodsWe selected 1,199 subjects receiving longitudinal care in a large health system and classified them for hypertension (N=608), hypertension with unexplained hypokalemia (N=172), and apparent treatment-resistant hypertension (N=176) by chart review. We derived 331 features from EHR encounters, diagnoses, laboratories, medications, vitals, and notes. We modified FEAT to encourage model parsimony and compared its models performance and interpretability to those of expert-curated heuristics and conventional machine learning. ResultsFEAT models trained to replicate expert-curated heuristics had higher area under the precision-recall curve (AUPRC) than all other models (p < 0.001) except random forests and were smaller than all other models (p < 1e-6) except decision trees. FEAT models trained to predict chart review phenotypes exhibited similar AUPRC to penalized logistic regression while being simpler than all other models (p < 1e-6). For treatment-resistant hypertension, FEAT learned a six-feature, clinically intuitive model that demonstrated a positive predictive value of 0.70 and sensitivity of 0.62 in held-out testing data. DiscussionFEAT learns computable phenotypes that approach the performance of expert-curated heuristics and conventional machine learning without sacrificing interpretability. ConclusionBy constructing accurate and interpretable computable phenotypes at scale, FEAT has the potential to facilitate systematic clinical decision support.
Obra, J. K.; Singh, C.; Watkins, K.; Feng, J.; Obermeyer, Z.; Kornblith, A. E.
Show abstract
Clinical decision instruments (CDIs) face an equity dilemma. On the one hand, they often reduce disparities in patient care through data-driven standardization of best practices. On the other hand, this standardization may itself inadvertently perpetuate bias and inequality within healthcare systems. Here, we quantify different measures of potential for implicit bias present in CDI development that can inform future CDI development. We find evidence for systematic bias in the development of 690 CDIs that underwent validation through various analyses: self-reported participant demographics are skewed--e.g. 73% of participants are White, 55% are male; investigator teams are geographically skewed--e.g. 52% in North America, 31% in Europe; CDIs use predictor variables that may be prone to bias--e.g. 13 CDIs explicitly use Race and Ethnicity; outcome definitions may further introduce bias--e.g. 28% of CDIs involve follow-up, which may disproportionately skew outcome representation based on socioeconomic status. As CDIs become increasingly prominent in medicine, we recommend that these factors are considered during development and clearly conveyed to clinicians using CDIs.
Gorenshtein, A.; Omar, M.; Glicksberg, B. S.; Nadkarni, G.; Klang, E.
Show abstract
BackgroundAI agents built on large language models (LLMs) can plan tasks, use external tools, and coordinate with other agents. Unlike standard LLMs, agents can execute multi-step processes, access real-time clinical information, and integrate multiple data sources. There has been interest in using such agents for clinical and administrative tasks, however, there is limited knowledge on their performance and whether multi-agent systems function better than a single agent for healthcare tasks. PurposeTo evaluate the performance of AI agents in healthcare, compare AI agent systems vs. standard LLMs and catalog the tools used for task completion Data SourcesPubMed, Web of Science, and Scopus from October 1, 2022, through August 5, 2025. Study SelectionPeer-reviewed studies implementing AI agents for clinical tasks with quantitative performance comparisons. Data ExtractionTwo reviewers (A.G., M.O.) independently extracted data on architectures, performance metrics, and clinical applications. Discrepancies were resolved by discussion, with a third reviewer (E.K.) consulted when consensus could not be reached. Data SynthesisTwenty studies met inclusion criteria. Across studies, all agent systems outperformed their baseline LLMs in accuracy performance. Improvements ranged from small gains to increases of over 60 percentage points, with a median improvement of 53 percentage points in single-agent tool-calling studies. These systems were particularly effective for discrete tasks such as medication dosing and evidence retrieval. Multi-agent systems showed optimal performance with up to 5 agents, and their effectiveness was particularly pronounced when dealing with highly complex tasks. The highest performance boost occurred when the complexity of the AI agent framework aligned with that of the task. LimitationsHeterogeneous outcomes precluded quantitative meta-analysis. Several studies relied on synthetic data, limiting generalizability. ConclusionsAI agents consistently improve clinical task performance of Base-LLMs when architecture matches task complexity. Our analysis indicates a step-change over base-LLMs, with AI agents opening previously inaccessible domains. Future efforts should be based on prospective, multi-center trials using real-world data to determine safety, task matched and cost-effectiveness. Primary Funding SourceThis work was supported in part through the computational and data resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences. Research reported in this publication was also supported by the Office of Research Infrastructure of the National Institutes of Health under award number S10OD026880 and S10OD030463. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. RegistrationPROSPERO CRD420251120318
Okazaki, M.
Show abstract
BackgroundConventional statistical methods in medical research often fail to capture real-world complexity due to rigid parametric assumptions, particularly normality, which frequently do not hold for clinical and epidemiological data. Heterogeneous distributions, heavy-tailed patterns, and multimodal structures are common in healthcare data, yet conventional methods often fail to capture these structural characteristics, leading to information loss and potentially misleading conclusions. Furthermore, regulatory audits and reproducibility requirements demand transparent, traceable analytical frameworks. ObjectiveThis study presents a comprehensive Distribution Structure Analysis (DSA) algorithm with an integrated audit-ready framework designed specifically for medical research. The algorithm systematically identifies distributional structures, ensures statistical rigor through explicit estimand specification and goodness-of-fit testing, and maintains complete audit trails for regulatory compliance. MethodsThe DSA algorithm integrates five key components: (1) explicit estimand specification aligned with research design, (2) automated distribution type identification (normal, log-normal, exponential, Weibull, power-law, and mixture models), (3) comprehensive goodness-of-fit assessment using multiple criteria (AIC/BIC, visual diagnostics, and statistical tests), (4) causal inference support through Directed Acyclic Graphs (DAG), and (5) automated audit logging with a three-tier quality control system (red/yellow/green). The algorithm was validated using both simulated datasets with known distributions and real-world medical data from clinical trials and epidemiological studies. ResultsValidation studies demonstrated that the DSA algorithm correctly identified distribution types with 95% accuracy across 1,000 simulated datasets. In clinical trial data analysis, the algorithm detected heavy-tailed distributions in adverse event frequencies that were missed by conventional normality-based methods, leading to more accurate safety assessments. The audit logging system successfully recorded all analytical decisions, enabling complete reproducibility. The three-tier quality control system flagged 12% of analyses for re-examination, preventing potential methodological errors. Application to epidemiological data revealed multimodal patterns in disease incidence that informed targeted public health interventions. ConclusionsThe DSA algorithm with integrated audit-ready framework provides a rigorous, transparent, and reproducible approach to distribution structure analysis in medical research. By explicitly addressing estimands, ensuring goodness-of-fit, and maintaining complete audit trails, the framework meets both statistical rigor and regulatory compliance requirements. The algorithm is applicable across diverse medical research domains, including clinical trials, epidemiology, health economics, and pharmacovigilance. Open-source implementation and comprehensive documentation facilitate adoption and validation by the research community.
Birgmeier, J.; Steinberg, E.; Bodle, E. E.; Deisseroth, C. A.; Jagadeesh, K. A.; Kohler, J. N.; Bonner, D.; Marwaha, S.; Martinez-Agosto, J. A.; Nelson, S.; Palmer, C. G.; Cogan, J. D.; Hamid, R.; Stoler, J. M.; Krier, J. B.; Rosenfeld, J. A.; Moretti, P.; Adams, D. R.; Shashi, V.; Worthey, E. A.; Eng, C. M.; Ashley, E. A.; Wheeler, M. T.; Undiagnosed Diseases Network, ; Stenson, P. D.; Cooper, D. N.; Bernstein, J. A.; Bejerano, G.
Show abstract
BackgroundMany thousands of patients with a suspected Mendelian disease have their exomes/genomes sequenced every year, but only about 30% receive a definitive diagnosis. Since a novel Mendelian gene-disease association is published on average every business day, thousands of undiagnosed patient cases could receive a diagnosis each year if their genomes were regularly compared to the latest literature. With millions of genomes expected to be sequenced for rare disease analysis by 2025, and considering the current publication rate of 1.1 million new articles per annum in PubMed, manually reanalyzing the growing cases of undiagnosed patients is not sustainable. MethodsWe describe a fully automated reanalysis framework for patients with suspected, but undiagnosed, Mendelian disorders. The presented framework was tested by automatically parsing all [~]100,000 newly published peer reviewed papers every month and matching them on genotype and phenotype with all stored undiagnosed patients. If a new article contains a possible diagnosis for an undiagnosed patient, the system provides notification. We test the accuracy of the automatic reanalysis system on 110 patients, including 61 with available trio data. ResultsEven when trained only on older data, our system identifies 80% of reanalysis diagnoses, while sending only 0.5-1 alerts per patient per year, a 100-1,000-fold efficiency gain over manual literature surveillance of equivalent yield. ConclusionWe show that automatic reanalysis of patients with suspected Mendelian disease is feasible and has the potential to greatly streamline diagnosis. Our system is not intended to replace clinical judgment. Rather, clinical diagnostic services could greatly benefit from a modest re-allocation of time from manual literature exploration to review of automated reanalysis alerts. Our system additionally supports a new paradigm for medical IT systems: proactive, continuously learning and consequently able to autonomously identify valuable insights as they emerge in digital health records. We have launched automated patient reanalysis, trained on the latest data, with user accounts and daily literature updates at https://AMELIE.stanford.edu.
Kleinlein, R.; Gray, K. J.; Bates, D.; Kovacheva, V. P.
Show abstract
ObjectiveElectronic health records (EHRs) contain valuable information for clinical research and decision-making. However, leveraging these data remains challenging due to data heterogeneity, inconsistent documentation, missing information, and evolving terminology, especially within unstructured clinical notes. We developed SPELL (Snippet-Primed rEgex LLM Pipeline), a scalable natural language processing (NLP) workflow to systematically extract structured clinical insights from large volumes of clinical narratives. Materials and MethodsOur platform employs a hybrid approach combining regular expressions (regex) to rapidly identify relevant textual snippets with locally hosted large language models (LLMs) for accurate clinical interpretation. All data processing occurs securely within institutional computational environments. The modular Python-based workflow facilitates adaptation across institutions and is optimized for computational efficiency, supporting high-throughput processing even in resource-limited settings. We quantified computational scalability (elapsed time, out-of-memory events, GPU temperature, and energy consumed) and audited retrieval recall using clinician-annotated regex-negative notes enriched with relevant structured metadata. ResultsThe pipeline efficiently processed 31 million clinical reports spanning 1976-2024 from eight affiliated hospitals. By analyzing targeted snippets rather than entire documents, our approach reduced processing time by 68% compared to traditional full-document LLM inference, and by >95% compared to manual physician annotation. Accuracy was rigorously validated across three obstetric tasks: extraction of numerical values (blood loss volumes), dates (estimated due dates), and diagnoses (hemolysis, elevated liver enzymes, and low platelets [HELLP] syndrome). Task-level performance included 94-98% exact-match accuracy for the three concepts on curated snippets. Generalizability was investigated using the publicly available MT Samples corpus (5,013 notes, 40 specialties), yielding 84% accuracy for ventricular tachycardia detection with markedly fewer false positives. Discussion and ConclusionsA hybrid regex[->]snippet[->]LLM approach delivers accurate, privacy-preserving, and computationally efficient extraction for unstructured EHR data. By focusing inference on snippets and deploying local, open-weights models, SPELL aligns with institutional data governance requirements while enabling scalable clinical informatics studies across diverse extraction tasks. Summary StatementWe developed SPELL, a scalable NLP pipeline combining regex and locally hosted LLMs for efficient information extraction from clinical narratives.
Fukataki, Y.; Hayashi, W.; Kitayama, M.; Ito, Y. M.
Show abstract
Retrieval-augmented generation (RAG) holds promise for supporting high-stakes medical decision-making. However, most research has focused on downstream optimization of parameters and algorithms. This Phase 1 foundational study quantitatively evaluated the upstream quality of knowledge documents and their impact on retrieval performance, using Japanese clinical research protocol manuals for Institutional Review Board pre-screening support as a case study. We established a three-tier evaluation framework: Level 1 assessed knowledge document quality through independent expert review across Structure, Granularity, and Noise dimensions; Level 2a evaluated the structural quality of retrieved chunks using large language model-as-a-Judge across five metrics; and Level 2b conducted proof-of-concept content appropriateness evaluation against a Gold Standard derived from international guidelines. Using Google Cloud Vertex AI Search, we analyzed 594 chunks from baseline knowledge (A-line: four institutional manuals as-is) and six chunks from optimized knowledge (B-line: proof of concept). Level 2a evaluations employed deterministic settings with five independent trials, achieving excellent reliability (intraclass correlation coefficient of 0.936). The results revealed substantial quality limitations in the A-line chunks: the median scores were 2.0 or below across all five metrics, with fewer than 20% of the chunks reaching practical utility thresholds (score of 4 or higher). Even among the top-ranked results, fewer than half met the practical utility criteria, except for Faithfulness. The inter-rater agreement in the Level 1 evaluation was fair (Fleiss kappa value of 0.269), indicating the need for framework refinement. The retrieved chunk lengths significantly exceeded the configured settings (median of 3,861 characters versus 500 tokens), potentially indicating information dilution. The B-line optimization achieved perfect scores across all metrics, demonstrating potential for improvement. These findings demonstrate that upstream document quality constrains retrieval performance, challenging assumptions regarding plug-and-play RAG deployment. Author SummaryArtificial intelligence systems that retrieve information from documents and generate responses are increasingly being used to support medical decision-making. We questioned the assumption that uploading existing documents is sufficient for fine-tuning the algorithms of these systems by investigating whether document quality is a limiting factor. We studied Japanese clinical research manuals used for research ethics review, assessing the efficacy of an AI system in retrieving information from these documents. We evaluated nearly 600 text segments retrieved by the system and found that fewer than one in five segments met our quality standards, even among the highest-ranked results. The system frequently retrieved excessively long passages obscuring key information. However, when we restructured one document section using clearer organization and formatting, the system achieved perfect performance scores. This improvement suggests that not only algorithm optimization but also document preparation is crucial for system effectiveness. Our findings challenge the "plug-and-play" assumption commonly used in AI deployment. For high-stakes medical applications, organizations cannot simply expect reliable results to be obtained by uploading existing documents. Instead, they must invest in preparing well-structured knowledge documents. This foundational work establishes measurement methods to guide such preparation, which is essential before these systems can safely support healthcare decision-making.
Moreira Melo, P. H.; Poenaru, D.; Guadagno, E.
Show abstract
BackgroundSystematic reviews (SRs) are essential for evidence-based medicine but require extensive time and resources for abstract screening. Large language models (LLMs) offer potential for automating this process, yet concerns about data privacy, intellectual property protection, and reproducibility limit the use of cloud-based solutions in research settings. ObjectiveTo evaluate the performance of a locally deployed 20-billion parameter LLM for automated abstract screening in systematic reviews using a sensitivity-enhanced prompting strategy, with blind expert adjudication of all discordant human-AI cases. MethodsWe deployed GPT-OSS:20B locally using Ollama and evaluated its performance across three systematic reviews: AI applications in pediatric surgical pathology (n=3,350), LLM applications in electronic health records (n=4,326), and parental stress/caregiver burden in surgically treated children (n=8,970). A sensitivity-enhanced prompting strategy instructing the model to include abstracts when uncertain was employed. All discordant cases underwent blind expert adjudication. ResultsAcross 16,646 abstracts, the LLM demonstrated variable sensitivity after expert adjudication: 100% in SR1, 95.7% in SR2, and 85.7% in SR3. Expert adjudication identified 11 human screening errors across all reviews that the LLM had correctly classified. The LLM completed screening 4.7 times faster than human reviewers. ConclusionsA locally deployed LLM with sensitivity-enhanced prompting shows promising performance for systematic review abstract screening, particularly for technology-focused topics. Performance variability across domains suggests that screening accuracy depends partly on the objectivity of inclusion criteria. We recommend deploying LLMs as second screeners alongside human reviewers until performance is more fully validated across diverse domains.